Tip: You will see quoted sections like this throughout the template to help you construct your report. Make sure that you remove these notes before you finish and submit your project!
Tip: One of the requirements of this project is that your code follows good formatting techniques, including limiting your lines to 80 characters or less. If you’re using RStudio, go into Preferences > Code > Display to set up a margin line to help you keep track of this guideline!
In this project we are going to explore White Wine Quality based on 4898 samples with 13 related variables.
Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The wine quality is spread on the scale from 1 to 10 with normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The fixed acidity is normal distributed with mean 6.855. The first plot shows outlines after 10 and major data is in between 6.5 to 7.5, so the second plot was made to eliminate the outliners.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile acidity spread from 0 through 0.9 with long tail and mean. So to understand the long tail better the plot was tranformed which discards few outliners and made bars spread across.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Alcohol is shewed normal distribution with most white wines are made of alcohol level 9.5 and the mean 10.51.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulpahtes are well distributed with mean 0.47. There are few outliners which are not big deal and were elimiated in the next plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The density spread as normal distribution with median 0.9937. I think there is an outliner with max value 1.039, and the second plot eliminates it.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The pH also distributed normally with 3.18 and seems theren’t any outliners even after the second plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Looks like the distribution of residual sugar not not normal and there are spike at 2. Also there is an outliner at 65.8.
The second plot looks more clean with binormal distribution. Which are doesn’t shows any outliners.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The chlorides distribution look normal with an outliner, but the second plot eliminates the outliner and shows distribution looks nice.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid is normal distributed but there is an outliner. More number of wines are made around 0.3 citric acid and also there is a spike at 4.9.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
The few sulfar dioxide features also similar to chlories with outliners and normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total sulfar dioxide also normally distributed with ouliner at 440 which is eliminated in the second plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.110 6.570 7.070 7.133 7.590 14.470
Here are the combined plots.
The above 2 plots shows the combining all features counts in a single plot with and without modifications.
The white wine data structure contains 4898 samples with 12 features which are directly or indirectly depend on the quality of the wine.The features are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality
The quality of the wine is varies from scale 3 to 9 based on above features.
All the above univariate plots are either normal distributed or skewed with few outliners
Obviously the main feature of the dataset is wine quality. As the quality is depends on the several features in the dataset, we need to explore them.
As per my search in internet the quality of the wine is based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc
Yes. Created total acitidy which is the sum of volatile and fixed acicity
Yes, I did few additional operations to eliminate the outliners. In addition to that, I have identified some unusual distribution in residual sugar, so I have applied log 10 and the data is now bimodal distributed. Rest of the features are either normal distributed or skewed.
ScatterPlot Matrix:
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.25581431 0.002857966
## fixed.acidity -0.255814305 1.00000000 -0.022697290
## volatile.acidity 0.002857966 -0.02269729 1.000000000
## citric.acid -0.149899918 0.28918070 -0.149471811
## residual.sugar 0.006623775 0.08902070 0.064286060
## chlorides -0.045645192 0.02308564 0.070511571
## free.sulfur.dioxide -0.011928911 -0.04939586 -0.097011939
## total.sulfur.dioxide -0.161979037 0.09106976 0.089260504
## density -0.185976097 0.26533101 0.027113845
## pH -0.115774132 -0.42585829 -0.031915368
## sulphates 0.009807759 -0.01714299 -0.035728147
## alcohol 0.213656245 -0.12088112 0.067717943
## quality 0.035763247 -0.11366283 -0.194722969
## total.acidity -0.254350594 0.99290766 0.096321153
## citric.acid residual.sugar chlorides
## X -0.149899918 0.006623775 -0.04564519
## fixed.acidity 0.289180698 0.089020701 0.02308564
## volatile.acidity -0.149471811 0.064286060 0.07051157
## citric.acid 1.000000000 0.094211624 0.11436445
## residual.sugar 0.094211624 1.000000000 0.08868454
## chlorides 0.114364448 0.088684536 1.00000000
## free.sulfur.dioxide 0.094077221 0.299098354 0.10139235
## total.sulfur.dioxide 0.121130798 0.401439311 0.19891030
## density 0.149502571 0.838966455 0.25721132
## pH -0.163748211 -0.194133454 -0.09043946
## sulphates 0.062330940 -0.026664366 0.01676288
## alcohol -0.075728730 -0.450631222 -0.36018871
## quality -0.009209091 -0.097576829 -0.20993441
## total.acidity 0.270135269 0.096274432 0.03136937
## free.sulfur.dioxide total.sulfur.dioxide density
## X -0.0119289106 -0.161979037 -0.18597610
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## volatile.acidity -0.0970119393 0.089260504 0.02711385
## citric.acid 0.0940772210 0.121130798 0.14950257
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## sulphates 0.0592172458 0.134562367 0.07449315
## alcohol -0.2501039415 -0.448892102 -0.78013762
## quality 0.0081580671 -0.174737218 -0.30712331
## total.acidity -0.0607153894 0.101284413 0.26738970
## pH sulphates alcohol quality
## X -0.1157741316 0.009807759 0.21365624 0.035763247
## fixed.acidity -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity -0.0319153683 -0.035728147 0.06771794 -0.194722969
## citric.acid -0.1637482114 0.062330940 -0.07572873 -0.009209091
## residual.sugar -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides -0.0904394560 0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide -0.0006177961 0.059217246 -0.25010394 0.008158067
## total.sulfur.dioxide 0.0023209718 0.134562367 -0.44889210 -0.174737218
## density -0.0935914935 0.074493149 -0.78013762 -0.307123313
## pH 1.0000000000 0.155951497 0.12143210 0.099427246
## sulphates 0.1559514973 1.000000000 -0.01743277 0.053677877
## alcohol 0.1214320987 -0.017432772 1.00000000 0.435574715
## quality 0.0994272457 0.053677877 0.43557472 1.000000000
## total.acidity -0.4277827423 -0.021316418 -0.11229714 -0.136319694
## total.acidity
## X -0.25435059
## fixed.acidity 0.99290766
## volatile.acidity 0.09632115
## citric.acid 0.27013527
## residual.sugar 0.09627443
## chlorides 0.03136937
## free.sulfur.dioxide -0.06071539
## total.sulfur.dioxide 0.10128441
## density 0.26738970
## pH -0.42778274
## sulphates -0.02131642
## alcohol -0.11229714
## quality -0.13631969
## total.acidity 1.00000000
There are lot of unexpected correlated coefficients between few features. So lets eliminate the non correlated coefficient features X, volatale acidity, citric acid, sulphates and quality and draw the ScatterPlot Matrix again.
## fixed.acidity residual.sugar chlorides
## fixed.acidity 1.00000000 0.08902070 0.02308564
## residual.sugar 0.08902070 1.00000000 0.08868454
## chlorides 0.02308564 0.08868454 1.00000000
## free.sulfur.dioxide -0.04939586 0.29909835 0.10139235
## total.sulfur.dioxide 0.09106976 0.40143931 0.19891030
## density 0.26533101 0.83896645 0.25721132
## pH -0.42585829 -0.19413345 -0.09043946
## alcohol -0.12088112 -0.45063122 -0.36018871
## total.acidity 0.99290766 0.09627443 0.03136937
## free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity -0.0493958591 0.091069756 0.26533101
## residual.sugar 0.2990983537 0.401439311 0.83896645
## chlorides 0.1013923521 0.198910300 0.25721132
## free.sulfur.dioxide 1.0000000000 0.615500965 0.29421041
## total.sulfur.dioxide 0.6155009650 1.000000000 0.52988132
## density 0.2942104109 0.529881324 1.00000000
## pH -0.0006177961 0.002320972 -0.09359149
## alcohol -0.2501039415 -0.448892102 -0.78013762
## total.acidity -0.0607153894 0.101284413 0.26738970
## pH alcohol total.acidity
## fixed.acidity -0.4258582910 -0.1208811 0.99290766
## residual.sugar -0.1941334540 -0.4506312 0.09627443
## chlorides -0.0904394560 -0.3601887 0.03136937
## free.sulfur.dioxide -0.0006177961 -0.2501039 -0.06071539
## total.sulfur.dioxide 0.0023209718 -0.4488921 0.10128441
## density -0.0935914935 -0.7801376 0.26738970
## pH 1.0000000000 0.1214321 -0.42778274
## alcohol 0.1214320987 1.0000000 -0.11229714
## total.acidity -0.4277827423 -0.1122971 1.00000000
I want provide my finding of Whitw Wine data set from the above Scatter Plot as below.
Let’s explore the relation between few correlated and non correlated features using Bivariate plots.
Above plot clearly shows a strong relationship between density and residual sugar as shown the correlation coefficient 0.839.
We can also see strong relationship between density and alcohol with correlation coef. -0.78.
The above plot proves that there is no correlation between quality and chlorides
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
## quality: 3
## [1] 1.086
## --------------------------------------------------------
## quality: 4
## [1] 8.166
## --------------------------------------------------------
## quality: 5
## [1] 75.103
## --------------------------------------------------------
## quality: 6
## [1] 99.388
## --------------------------------------------------------
## quality: 7
## [1] 33.608
## --------------------------------------------------------
## quality: 8
## [1] 6.705
## --------------------------------------------------------
## quality: 9
## [1] 0.137
The histogram shows the quality is better for medium concentrated cholide wines.
The above plot proves that there is no correlation between quality and total acidity.
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.415 6.820 7.705 7.933 8.857 12.030
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.450 6.745 7.310 7.511 7.920 10.910
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.690 6.660 7.140 7.236 7.730 10.550
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.110 6.550 7.030 7.098 7.567 14.470
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.370 6.505 6.980 6.997 7.460 9.450
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.125 6.475 7.040 6.935 7.490 8.570
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.960 7.260 7.360 7.718 7.640 9.370
## quality: 3
## [1] 158.665
## --------------------------------------------------------
## quality: 4
## [1] 1224.24
## --------------------------------------------------------
## quality: 5
## [1] 10542.83
## --------------------------------------------------------
## quality: 6
## [1] 15601.92
## --------------------------------------------------------
## quality: 7
## [1] 6157.785
## --------------------------------------------------------
## quality: 8
## [1] 1213.545
## --------------------------------------------------------
## quality: 9
## [1] 38.59
The quality of wine is better for medium total acitidy wines.
Again the correlation between the quality and density is very week.
## $title
## [1] "Quality by density"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
## $title
## [1] "Quality by density"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9911 0.9925 0.9944 0.9949 0.9969 1.0001
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9958 1.0004
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0024
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0004
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0006
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9897 0.9898 0.9903 0.9915 0.9906 0.9970
## quality: 3
## [1] 19.89768
## --------------------------------------------------------
## quality: 4
## [1] 162.0671
## --------------------------------------------------------
## quality: 5
## [1] 1450.098
## --------------------------------------------------------
## quality: 6
## [1] 2184.727
## --------------------------------------------------------
## quality: 7
## [1] 873.3581
## --------------------------------------------------------
## quality: 8
## [1] 173.6413
## --------------------------------------------------------
## quality: 9
## [1] 4.9573
Density and quality have a loose negative correlation of 0.307. That is reflecting in boxplot. After removing the top 1% of outliers, the jitter chart shows a downward trend from left to right. Again, the boxplot supports this assertion because as quality increases from 5 to 9, the quartile ranges for alcohol steadily decrease. We now know that higher quality ratings are associated with lower density values.
Looks like the quality have a good dependency on alocohol with the correlation coeficient 0.403
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
## quality: 3
## [1] 206.9
## --------------------------------------------------------
## quality: 4
## [1] 1654.85
## --------------------------------------------------------
## quality: 5
## [1] 14291.48
## --------------------------------------------------------
## quality: 6
## [1] 23244.67
## --------------------------------------------------------
## quality: 7
## [1] 10003.78
## --------------------------------------------------------
## quality: 8
## [1] 2036.3
## --------------------------------------------------------
## quality: 9
## [1] 60.9
More white wines are made with alcohol levelaround 9. But when we look at the histogram the quality increases when alcohol level increases which supports the correlation coeffient 0.435.
Bivariate Analysis Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? The main relationships in this bivariate analysis are found related with the alcohol feature. We could see that it has a strong relationship with the density and the residual sugar.
But no single relationship (at leats remarkable) could be found with the quality. Each of the features analyzed aren’t somehow related with the quality. This is something we can expected because is not that easy to have a good wine quality, isn’t it?
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? The most interesting relationships involve the density feature. In fact seeing the correlations between features, density has almost always the highest values.
What was the strongest relationship you found? The strongest relationship is between density and residual sugar. A correlation of 0.84 gives us a strong relationship. Also density with alcohol (-0.78) are strongly correlated.
The above mutli plot diagram shows the relationship between the density and alcohol for individual quality levels.
Creating mtable for white wines dataset.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol^(1/3)), data = wine)
## m2: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides, data = wine)
## m3: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density,
## data = wine)
## m4: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density +
## pH, data = wine)
## m5: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density +
## pH + sulphates, data = wine)
## m6: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density +
## pH + total.acidity, data = wine)
##
## ========================================================================================================
## m1 m2 m3 m4 m5 m6
## --------------------------------------------------------------------------------------------------------
## (Intercept) -4.065*** -3.442*** -28.461*** -29.003*** -26.493*** -44.608***
## (0.296) (0.327) (6.483) (6.479) (6.501) (6.733)
## I(alcohol^(1/3)) 4.545*** 4.313*** 4.980*** 4.929*** 4.875*** 5.315***
## (0.135) (0.145) (0.225) (0.226) (0.226) (0.230)
## chlorides -2.482*** -2.382*** -2.295*** -2.349*** -2.376***
## (0.559) (0.559) (0.559) (0.558) (0.556)
## density 23.694*** 23.554*** 21.109*** 40.242***
## (6.132) (6.126) (6.149) (6.442)
## pH 0.248** 0.200** -0.047
## (0.076) (0.077) (0.084)
## sulphates 0.398***
## (0.101)
## total.acidity -0.124***
## (0.016)
## --------------------------------------------------------------------------------------------------------
## R-squared 0.187 0.191 0.193 0.195 0.198 0.205
## adj. R-squared 0.187 0.190 0.193 0.194 0.197 0.204
## sigma 0.798 0.797 0.796 0.795 0.794 0.790
## F 1129.793 576.903 390.673 296.255 240.790 252.560
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5846.130 -5836.294 -5828.835 -5823.494 -5815.779 -5792.251
## Deviance 3120.832 3108.324 3098.870 3092.119 3082.394 3052.922
## AIC 11698.259 11680.589 11667.669 11658.987 11645.558 11598.502
## BIC 11717.749 11706.575 11700.152 11697.967 11691.034 11643.978
## N 4898 4898 4898 4898 4898 4898
## ========================================================================================================
As we could saw in the bivariate section, density with residual sugar and alcohol have a big correlation and as we can appreciate this happens with every wine quality.
Furthermore, a small relationship appears when mixing total acidity with residual sugar and alcohol. In the linear model a 0.2 appears for the R-squared value. This means a 20% of the quality variance could accounted.
As said before, the most interesting feature is the density, analyzed with alcohol and residual sugar. No special interaction could be seen in this section.
The distribution of residual sugar amount appears to be bimodal. This is not easy to explain, maybe a demand of a well differenced wine sweet flavour. However it exists an official category for the sweetness of the wines but the are almost outliers in this data set:
## $title
## [1] "Histogram of Density with color set by Quality"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
This histogram provides a better visualization of the relationship between density and quality. Since there are some outliers for density, I removed the top and bottom 1% from the chart.
Since the color is set by the quality, each quality value has a unique impact on the overall histogram, and I can draw insight from these distributions. The center of the distributions shift to the left as the color changes darker blue. This means that the main concentration of density values decreases as the quality increases.
Summarizing the data by quality supports this assertion: The median density steadily decreases from 0.9953 to 0.9903 as the quality value increases from “poor” (5) to “good” (9). This is similar to what I noticed when evaluating the correlation between alcohol and quality in the last plot. I realized that I needed to investigate the relationship between density and alcohol.
This plot reflects the relationship between density, alcohol, and sugar in a single visualization. I split the residual sugar values into two buckets delineated by the median value of 5.2 in order to see the trends more clearly.
I can see that as the alcohol level increases, the density decreases because the scatterplot has a downward trend to the right. This suggests that alcohol is one of the less dense ingredients in wine. Also, the sugar red/blue coloring shows that as the sugar increases, the density also increases, since the blue dots are higher on the chart than the red dots. This suggests that sugar is one of the more dense ingredients in wine. Thirdly, there is a heavier concentration of blue dots on the left side of the chart than the right side, which means that lower alcohol levels are associated with higher levels of sugar. The correlation values between these variables support all of these insights from the chart.
I investigated the wine-making process in order to better understand the relationship between these features. Fermentation converts the sugars to alcohol, so the conclusions from this chart make logical sense. This was interesting to me, because the data helped me understand how wine is created.
The white wines data set contains information on almost 5000 wines. First of all an exploratory data analysis was performed to understand the fearures. Also some internet investigation to contextualize and learn about the topic. This gave me some references about how quality could be calculated/predicted given some of the features already provided in the dataset. Before this some relations call my attention like the high relationship of the density with some other features like alcohol and residual sugar. Finally trying to find any relations to set a good quality was quite frustrating. Some internet investigations direct me to this formula: Sweet Taste (sugars + alcohols) <= => Acid Taste (acids). But the final thought wasn’t as easy as it seems. I could find a small relationship between this features but looking at the resultant linear model a small qualtity of wines are accounted (21%).
Some conclusions I can extract is that the data set lacks of a more spreaded quality values. Almost all the wines are ‘NORMAL’ and it’s difficult the clusterize. Also I think that my analysis was a bit biased trying to predict the quality given the previous formula.
In a next iteration or further analysis the first thing to come with is the strange peak saw in the citric acid histogram. Another possible way to drive a new analysis is including another features for the final modeling, trying to increase the percent of wines accounted.
Feature Knowledge acidity http://www.calwineries.com/learn/wine-chemistry/acidity http://winemakersacademy.com/understanding-wine-acidity/
volatile acidity http://extension.psu.edu/food/enology/wine-production/volatile-acidity-in-wine
citric acid http://www.calwineries.com/learn/wine-chemistry/wine-acids/citric-acid
residual sugar http://www.calwineries.com/learn/wine-chemistry/sugar-in-wine
alcohol http://www.calwineries.com/learn/wine-chemistry/alcohol
Some thoughts